{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overfit-generalization-underfit\n",
    "\n",
    "In the previous notebook, we presented the general cross-validation framework\n",
    "and how it helps us quantify the training and testing errors as well as their\n",
    "fluctuations.\n",
    "\n",
    "In this notebook, we put these two errors into perspective and show how they\n",
    "can help us know if our model generalizes, overfits, or underfits.\n",
    "\n",
    "Let's first load the data and create the same model as in the previous\n",
    "notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "housing = fetch_california_housing(as_frame=True)\n",
    "data, target = housing.data, housing.target\n",
    "target *= 100  # rescale the target in k$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "regressor = DecisionTreeRegressor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overfitting vs. underfitting\n",
    "\n",
    "To better understand the generalization performance of our model and maybe\n",
    "find insights on how to improve it, we compare the testing error with the\n",
    "training error. Thus, we need to compute the error on the training set, which\n",
    "is possible using the `cross_validate` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import cross_validate, ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(n_splits=30, test_size=0.2, random_state=0)\n",
    "cv_results = cross_validate(\n",
    "    regressor,\n",
    "    data,\n",
    "    target,\n",
    "    cv=cv,\n",
    "    scoring=\"neg_mean_absolute_error\",\n",
    "    return_train_score=True,\n",
    "    n_jobs=2,\n",
    ")\n",
    "cv_results = pd.DataFrame(cv_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cross-validation used the negative mean absolute error. We transform the\n",
    "negative mean absolute error into a positive mean absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = pd.DataFrame()\n",
    "scores[[\"train error\", \"test error\"]] = -cv_results[\n",
    "    [\"train_score\", \"test_score\"]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "scores.plot.hist(bins=50, edgecolor=\"black\")\n",
    "plt.xlabel(\"Mean absolute error (k$)\")\n",
    "_ = plt.title(\"Train and test errors distribution via cross-validation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By plotting the distribution of the training and testing errors, we get\n",
    "information about whether our model is over-fitting, under-fitting (or both at\n",
    "the same time).\n",
    "\n",
    "Here, we observe a **small training error** (actually zero), meaning that the\n",
    "model is **not under-fitting**: it is flexible enough to capture any\n",
    "variations present in the training set.\n",
    "\n",
    "However the **significantly larger testing error** tells us that the model is\n",
    "**over-fitting**: the model has memorized many variations of the training set\n",
    "that could be considered \"noisy\" because they do not generalize to help us\n",
    "make good prediction on the test set.\n",
    "\n",
    "## Validation curve\n",
    "\n",
    "We call **hyperparameters** those parameters that potentially impact the\n",
    "result of the learning and subsequent predictions of a predictor. For example:\n",
    "\n",
    "- the number of neighbors in a k-nearest neighbor model;\n",
    "\n",
    "- the degree of the polynomial.\n",
    "\n",
    "Some model hyperparameters are usually the key to go from a model that\n",
    "underfits to a model that overfits, hopefully going through a region were we\n",
    "can get a good balance between the two. We can acquire knowledge by plotting a\n",
    "curve called the validation curve. This curve can also be applied to the above\n",
    "experiment and varies the value of a hyperparameter.\n",
    "\n",
    "For the decision tree, the `max_depth` hyperparameter is used to control the\n",
    "tradeoff between under-fitting and over-fitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import numpy as np\n",
    "from sklearn.model_selection import ValidationCurveDisplay\n",
    "\n",
    "max_depth = np.array([1, 5, 10, 15, 20, 25])\n",
    "disp = ValidationCurveDisplay.from_estimator(\n",
    "    regressor,\n",
    "    data,\n",
    "    target,\n",
    "    param_name=\"max_depth\",\n",
    "    param_range=max_depth,\n",
    "    cv=cv,\n",
    "    scoring=\"neg_mean_absolute_error\",\n",
    "    negate_score=True,\n",
    "    std_display_style=\"errorbar\",\n",
    "    n_jobs=2,\n",
    ")\n",
    "_ = disp.ax_.set(\n",
    "    xlabel=\"Maximum depth of decision tree\",\n",
    "    ylabel=\"Mean absolute error (k$)\",\n",
    "    title=\"Validate curve for decision tree\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The validation curve can be divided into three areas:\n",
    "\n",
    "- For `max_depth < 10`, the decision tree underfits. The training error and\n",
    "  therefore the testing error are both high. The model is too constrained and\n",
    "  cannot capture much of the variability of the target variable.\n",
    "\n",
    "- The region around `max_depth = 10` corresponds to the parameter for which\n",
    "  the decision tree generalizes the best. It is flexible enough to capture a\n",
    "  fraction of the variability of the target that generalizes, while not\n",
    "  memorizing all of the noise in the target.\n",
    "\n",
    "- For `max_depth > 10`, the decision tree overfits. The training error becomes\n",
    "  very small, while the testing error increases. In this region, the models\n",
    "  create decisions specifically for noisy samples harming its ability to\n",
    "  generalize to test data.\n",
    "\n",
    "Note that for `max_depth = 10`, the model overfits a bit as there is a gap\n",
    "between the training error and the testing error. It can also potentially\n",
    "underfit also a bit at the same time, because the training error is still far\n",
    "from zero (more than 30 k\\\\$), meaning that the model might still be too\n",
    "constrained to model interesting parts of the data. However, the testing error\n",
    "is minimal, and this is what really matters. This is the best compromise we\n",
    "could reach by just tuning this parameter.\n",
    "\n",
    "Be aware that looking at the mean errors is quite limiting. We should also\n",
    "look at the standard deviation to assess the dispersion of the score. For such\n",
    "purpose, we can use the parameter `std_display_style` to show the standard\n",
    "deviation of the errors as well. In this case, the variance of the errors is\n",
    "small compared to their respective values, and therefore the conclusions above\n",
    "are quite clear. This is not necessarily always the case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is noise?\n",
    "\n",
    "In this notebook, we talked about the fact that datasets can contain noise.\n",
    "\n",
    "There can be several kinds of noises, among which we can identify:\n",
    "\n",
    "- measurement imprecision from a physical sensor (e.g. temperature);\n",
    "- reporting errors by human collectors.\n",
    "\n",
    "Those unpredictable data acquisition errors can happen either on the input\n",
    "features or in the target variable (in which case we often name this label\n",
    "noise).\n",
    "\n",
    "In practice, the **most common source of \"noise\" is not necessarily a\n",
    "real noise**, but rather **the absence of the measurement of a relevant\n",
    "feature**.\n",
    "\n",
    "Consider the following example: when predicting the price of a house, the\n",
    "surface area will surely impact the price. However, the price will also be\n",
    "influenced by whether the seller is in a rush and decides to sell the house\n",
    "below the market price. A model will be able to make predictions based on the\n",
    "former but not the latter, so \"seller's rush\" is a source of noise since it\n",
    "won't be present in the features.\n",
    "\n",
    "Since this missing/unobserved feature is randomly varying from one sample to\n",
    "the next, it appears as if the target variable was changing because of the\n",
    "impact of a random perturbation or noise, even if there were no significant\n",
    "errors made during the data collection process (besides not measuring the\n",
    "unobserved input feature).\n",
    "\n",
    "One extreme case could happen if there where samples in the dataset with exactly\n",
    "the same input feature values but different values for the target variable. That\n",
    "is very unlikely in real life settings, but could be the case if all features\n",
    "are categorical or if the numerical features were discretized or rounded up\n",
    "naively. In our example, we can imagine two houses having the exact same\n",
    "features in our dataset, but having different prices because of the (unmeasured)\n",
    "seller's rush.\n",
    "\n",
    "Apart from this extreme case, it's hard to know for sure what should qualify or\n",
    "not as noise and which kind of \"noise\" as introduced above is dominating. But in\n",
    "practice, the best way to make our predictive models robust to noise is to\n",
    "avoid overfitting models by:\n",
    "\n",
    "- selecting models that are simple enough or with tuned hyper-parameters as\n",
    "  explained in this module;\n",
    "- collecting a larger number of labeled samples for the training set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary:\n",
    "\n",
    "In this notebook, we saw:\n",
    "\n",
    "* how to identify whether a model is generalizing, overfitting, or\n",
    "  underfitting;\n",
    "* how to check influence of a hyperparameter on the underfit/overfit tradeoff."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}